這 效能悖論 指出,即使一個數學上完美的核心運算(例如 $out = x + y$),若未能分攤 GPU 硬體的固定成本,實際表現反而可能劣於 CPU 迴圈。這種現象通常體現在 啟動稅。
1. 「正確性」的迷思
功能上的正確性並不等同於效率。雖然您的 Triton 程式碼可能正確地將工作分配到數千個線程中,但如果總工作量(N)過小,GPU 仍會處於未充分利用狀態。硬體花費在狀態切換上的時間,遠多於實際的運算時間。
2. Python 測量陷阱
使用 Python 進行 GPU 程式碼的效能測試時 time.time() 具有風險。GPU 呼叫是 非同步;Python 只是 排隊 指令並繼續執行。若無 torch.cuda.synchronize(),您測量的是排隊時間。加上同步後,您測量的是 主機至裝置延遲,其長度通常比核心執行時間長達十倍。
3. 延遲與吞吐量
要克服此悖論,必須提供足夠的工作量來「隱藏」啟動延遲。這正是從 延遲受限 模式(受制於 CPU-GPU 介面)轉向 吞吐量受限 模式(受限於 GPU 記憶體或運算能力)。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).
N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch
N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic
N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch
All are compute-bound.
✅ Correct!
At very small N, launch overhead dominates. Large vector adds are memory-bandwidth limited. Dense Matrix Multiplications have high arithmetic intensity and become compute-bound.❌ Incorrect
Think about the ratio of math to data movement, and the constant cost of starting a kernel.QUESTION 2
In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?
Arithmetic Throughput
Memory Bandwidth
Register Pressure
L1 Cache Size
✅ Correct!
ReLU is memory-bound. It performs one very simple comparison (max(0,x)) for every load and store, resulting in extremely low arithmetic intensity.❌ Incorrect
Does ReLU perform complex math, or does it spend most of its time moving data to and from HBM?QUESTION 3
What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?
The GPU and CPU always finish at the same time.
The CPU continues to the next line of code before the GPU kernel finishes.
The kernel runs faster on smaller GPUs.
Memory transfers are blocked by compute.
✅ Correct!
Correct. This is why synchronization is required for accurate timing; otherwise, you just time how long it took to send the command.❌ Incorrect
If the CPU waited for every GPU call, performance would be significantly worse due to constant idle cycles.QUESTION 4
Why does $out = x + y$ exhibit low arithmetic intensity?
It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.
The addition operation is too complex for the ALUs.
It requires shared memory synchronization.
It only runs on one SM.
✅ Correct!
High-performance compute requires many FLOPs per byte moved. Vector add is the opposite, making it bandwidth-limited.❌ Incorrect
Count the number of times you access memory (tl.load/tl.store) versus the number of math operations (+).QUESTION 5
How can the 'Launch Tax' be amortized in a real-world application?
By calling the kernel more frequently with smaller data.
By increasing the workload per launch (e.g., larger N or batching).
By using 16-bit floats instead of 32-bit floats.
By disabling the L2 cache.
✅ Correct!
Increasing the workload makes the fixed overhead a smaller percentage of the total execution time.❌ Incorrect
Smaller data sizes actually make the launch tax more prominent relative to the useful work.Case Study: The Overhead Audit
Interpreting Host vs. Device Benchmarks
A developer runs a Triton kernel for Vector Addition on 512 elements. They measure 45 microseconds using Python's `time.time()`. When profiling the same kernel using NVIDIA Nsight Systems, the actual GPU duration is reported as only 2.1 microseconds.
Q
1. What is the approximate 'Launch Tax' in microseconds for this scenario, and what percentage of the total measured time does it represent?
Solution:
The Launch Tax is approximately 42.9 microseconds (45ms total - 2.1ms work). This represents ~95.3% of the total time. This indicates the application is heavily bound by system overhead rather than computation.
The Launch Tax is approximately 42.9 microseconds (45ms total - 2.1ms work). This represents ~95.3% of the total time. This indicates the application is heavily bound by system overhead rather than computation.
Q
2. If the developer increases N to 1,000,000 elements, assuming the kernel now takes 150 microseconds on the GPU, how does the Launch Tax impact the overall efficiency?
Solution:
With a constant launch overhead of ~43us, the total time would be ~193us. The overhead now only accounts for ~22.3% of the time. Efficiency improves as N increases because the fixed cost is spread over a much larger volume of compute/memory work.
With a constant launch overhead of ~43us, the total time would be ~193us. The overhead now only accounts for ~22.3% of the time. Efficiency improves as N increases because the fixed cost is spread over a much larger volume of compute/memory work.